Policy Gradient Quiz

Suppose we are training an agent to play a computer game. There are only two possible action:

0 = Do nothing,
1 = Move

There are three time-steps in each game, and our policy is completely determined by one parameter \theta, such that the probability of "moving" is \theta, and the probability of doing nothing is 1-\theta

Initially \theta=0.5. Three games are played, the results are:

Game 1:
actions: (1,0,1)
rewards: (1,0,1)

Game 2:
actions: (1,0,0)
rewards: (0,0,1)

Game 3:
actions: (0,1,0)
rewards: (1,0,1)

Computing policy gradient

SOLUTION:

(2,1,1)

SOLUTION:

-2

SOLUTION:

The contribution to the gradient from the second and third steps cancel each other
The computed policy gradient from this game is negative
Using the total reward vs future reward give the same policy gradient in this game